Left join could use bitmap for left join instead of Vec<bool>#1291
Conversation
| let indices = | ||
| UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| { | ||
| (unmatched ^ visited_left_side.get_bit(v)).then(|| v as u64) | ||
| })); |
There was a problem hiding this comment.
Are we sure this isn't a degradation in performance by doing the check per item?
There was a problem hiding this comment.
You mean the xor? I'm not sure how big of an impact it might have, but I can remove it if you believe it is worth it 🤷♂️
There was a problem hiding this comment.
it would be good to create a simple hash join micro benchmark in datafusion/benches to validate this.
There was a problem hiding this comment.
I will try to find the time to write such benchmark, but is there a possibility I can do it in a separate PR? Are we really that non confident that this PR will improve performance (even if by a bit)?
There was a problem hiding this comment.
Can't speak for @Dandandan. At least for me the memory saving is obvious 👍 But it's unclear to me how much overhead unmatched ^ visited_left_side.get_bit(v) adds. If we can look into the generated machine code and compare, then that would be another way to prove whether it is more efficient than what we have right now.
There was a problem hiding this comment.
I ran the performance tests against this branch (merged to master) and master:
For q13 (which has a left join): at TPCH Scale Factor (SF) 10 (aka around 10GB of data), on a GCP 16core 64GB RAM machine, running Ubuntu 20.04:
My results show no measureable difference
Coverted to parquet via:
cargo run --release --bin tpch -- convert --input ./data --output ./tpch-parquet --format parquetThen benchmarked using
cargo run --release --bin tpch -- benchmark datafusion --mem-table --format parquet --path ./tpch-parquet --query 13On master:
Query 13 iteration 0 took 10017.1 ms
Query 13 iteration 1 took 10638.8 ms
Query 13 iteration 2 took 10010.3 ms
Query 13 avg time: 10222.08 ms
On this branch merged to master:
git checkout boazberman/master
git merge origin/masterQuery 13 iteration 0 took 10438.6 ms
Query 13 iteration 1 took 10409.1 ms
Query 13 iteration 2 took 10030.8 ms
Query 13 avg time: 10292.82 ms
When I ran the same test again a few times, it reported avg times with sigificant deviation
...
Query 13 avg time: 10750.95 ms
...
Query 13 avg time: 10325.13 ms
...
Query 13 avg time: 10460.80 ms
This leads me to conclude the very small reported difference is noise.
Note that Q13 is:
select
c_count,
count(*) as custdist
from
(
select
c_custkey,
count(o_orderkey)
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey
) as c_orders (c_custkey, c_count)
group by
c_count
order by
custdist desc,
c_count desc;There was a problem hiding this comment.
Maybe to remove all doubt we could skip the check on each iteration. Something like (untested)
let indices = if unmached {
UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
(!visited_left_side.get_bit(v)).then(|| v as u64)
}))
} else {
UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
(visited_left_side.get_bit(v)).then(|| v as u64)
}));
}There was a problem hiding this comment.
Thanks @alamb for running the benchmark. I agree with your simplification as well 👍
There was a problem hiding this comment.
@boazberman are you willing to make the proposed change? I am also happy to do so
There was a problem hiding this comment.
@alamb I made the proposed changes. Thanks.
|
Thanks for sticking with this @boazberman ❤️ I know it was a long trip |
|
Yay, at last 🚀 |
|
Thank you @boaz-codota |
Which issue does this PR close?
Closes #240.
This is a new version (arrow native) of: #884 (which went stale because I didn't had the time to finish it up until now)
Rationale for this change
Described in the issue.
What changes are included in this PR?
Described in the issue.
Are there any user-facing changes?